โครงการอบรมเช งปฏ บ ต การ น กว ทยาการข อม ลภาคร ฐ (Government Data Scientist)

โครงการอบรมเช งปฏ บ ต การ น กว ทยาการข อม ลภาคร ฐ (Government Data Scientist) (ว นท 2) สถาบ นพ ฒนาบ คลากรด านด จ ท ลภาคร ฐ สาน กงานพ ฒนาร ฐบาลด จ ท ล (องค การมหาชน) บรรยายโดย ผศ.ดร.โษฑศ ร ตต ธรรมบ ษด อาจารย ประจากล มสาขาว ชาเทคโนโลย การจ ดการระบบสารสนเทศ คณะว ศวกรรมศาสตร มหาว ทยาล ยมห ดล

2 Keep in touch Materials https://goo.gl/kavknv

3 Agenda Day 2 09.00-09.30 [M07] Introduction to Machine Learning 09.30-10.30 [M08] Basic Classification Techniques 10.45-11.30 [M09] Classification Evaluation Techniques 11.30-12.00 [M10] Regression Methods 13.00-14.30 [M11] Advanced Classification Techniques 14.45-16.00 [M12] Model Integration and Results Optimization

4 Agenda Day 3 09.00-10.30 [M13] Advanced Data Preprocessing 10.45-12.00 [M14] Clustering Techniques 13.00-14.30 [M15] Semi-supervised Learning 14.45-16.00 [M16] Association Rules Discovery

5 Agenda Day 4 09.00-09.30 [M17] Introduction to Data Visualization and Business Intelligence 9.30-10.30 [M18] Data Blending and Data Integration for Visualization 10.45 12.00 [M19] Customizing Data Display 13.00-14.30 [M20] Creating and applying predefined functions 14.45-16.00 [M21] Visual Analytics, Dashboard Design, and Story Telling

6 Agenda 09.00-11.30 Project Presentation 11.30 12.00 Champion Announcement Day 5 (Pitching Day)

7 Module 7 Introduction to Machine Learning

8 What is machine learning? Machine learning is the subfield of computer science that, according to Arthur Samuel in 1959, gives "computers the ability to learn without being explicitly programmed. Evolved from the study of pattern recognition and computational learning theory in artificial intelligence, machine learning explores the study and construction of algorithms that can learn from and make predictions on data.

10 How Machine Learning Works

11 Module 8 Basic Classification Techniques

12 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

10 10 13 Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Test Set Induction Deduction Learning algorithm Learn Model Apply Model Model

14 Classification Workflow (Basic)

15 Classification Techniques Decision Tree based Methods Memory based reasoning Naïve Bayes and Bayesian Belief Networks Neural Networks Etc..

16 Decision Tree

17 Decision Tree

18 Constructing decision trees Strategy: top down Recursive divide-and-conquer fashion First: select attribute for root node Create branch for each possible attribute value Then: split instances into subsets One for each branch extending from the node Finally: repeat recursively for each branch, using only instances that reach the branch Stop if all instances have the same class

19 Which attribute to select?

20 Which attribute to select?

10 21 Example of a Decision Tree Splitting Attributes Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No Yes Refund No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No NO MarSt Single, Divorced TaxInc < 80K > 80K Married NO 8 No Single 85K Yes 9 No Married 75K No NO YES 10 No Single 90K Yes Training Data Model: Decision Tree

10 22 Another Example of Decision Tree Married MarSt Single, Divorced Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No Cheat NO Yes NO Refund No TaxInc < 80K > 80K 4 Yes Married 120K No 5 No Divorced 95K Yes NO YES 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No There could be more than one tree that fits the same data! 10 No Single 90K Yes

10 23 Apply Model to Test Data Start from the root of tree. Test Data Refund Marital Status Taxable Income Cheat Refund No Married 80K? Yes No NO Single, Divorced MarSt Married TaxInc < 80K > 80K NO NO YES

10 24 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO Single, Divorced MarSt Married TaxInc < 80K > 80K NO NO YES

10 25 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO Single, Divorced MarSt Married TaxInc < 80K > 80K NO NO YES

10 26 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO Single, Divorced MarSt Married TaxInc < 80K > 80K NO NO YES

10 27 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? NO Single, Divorced MarSt Married TaxInc < 80K > 80K NO NO YES

10 10 28 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Test Set Induction Deduction Tree Induction algorithm Learn Model Apply Model Model Decision Tree

29 Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting

30 How to Specify Test Condition? Depends on attribute types Nominal Ordinal Continuous Depends on number of ways to split 2-way split Multi-way split

31 Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values. Family CarType Sports Luxury Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury} CarType {Family} OR {Family, Luxury} CarType {Sports}

32 Splitting Based on Ordinal Attributes Multi-way split: Use as many partitions as distinct values. Small Size Medium Large Binary split: Divides values into two subsets. Need to find optimal partitioning. {Small, Medium} Size {Large} {Small, Large} Size {Medium} {Medium, Large} Size {Small} What about this split?

33 Splitting Based on Continuous Attributes Different ways of handling Discretization to form an ordinal categorical attribute Static discretize once at the beginning Dynamic ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Binary Decision: (A < v) or (A v) consider all possible splits and finds the best cut can be more compute intensive

34 Splitting Based on Continuous Attributes Taxable Income > 80K? Taxable Income? < 10K > 80K Yes No [10K,25K) [25K,50K) [50K,80K) (i) Binary split (ii) Multi-way split

35 Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting

36 How to determine the Best Split Own Car? Before Splitting: 10 records of class 0, 10 records of class 1 Car Type? Student ID? Yes No Family Luxury c 1 c 10 c 20 Sports c 11 C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 C0: 1 C1: 0... C0: 1 C1: 0 C0: 0 C1: 1... C0: 0 C1: 1 Which test condition is the best?

37 How to determine the Best Split Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: C0: 5 C1: 5 C0: 9 C1: 1 Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity

38 Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting

39 Stopping Criteria for Tree Induction Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have similar attribute values Early termination

40 Decision Tree Based Classification Advantages: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets

41 Underfitting and Overfitting (Example) 500 circular and 500 triangular data points. Circular points: 0.5 sqrt(x 12 +x 22 ) 1 Triangular points: sqrt(x 12 +x 22 ) > 0.5 or sqrt(x 12 +x 22 ) < 1

42 Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large

43 Overfitting due to Noise Decision boundary is distorted by noise point

44 Notes on Overfitting Overfitting results in decision trees that are more complex than necessary Training error no longer provides a good estimate of how well the tree will perform on previously unseen records Need new ways for estimating errors

45 Lab4 Decision Tree

46 Lab4 Try different -Maximal depth - Confidence - Apply pruning / prepruning

47 Instance Based Classifiers Examples: Rote-learner Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly K-Nearest neighbor Uses k closest points (nearest neighbors) for performing classification

48 Nearest Neighbor Classifiers Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck Compute Distance Test Record Training Records Choose k of the nearest records

49 Nearest-Neighbor Classifiers Unknown record Requires three things The set of stored records Distance Metric to compute distance between records The value of k, the number of nearest neighbors to retrieve To classify an unknown record: Compute distance to other training records Identify k nearest neighbors Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

50 Definition of Nearest Neighbor X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x

51 Nearest Neighbor Classification Compute distance between two points: Euclidean distance d( p, q) i ( p q i i ) 2 Determine the class from nearest neighbor list take the majority vote of class labels among the k-nearest neighbors Weigh the vote according to distance weight factor, w = 1/d 2

52 Nearest Neighbor Classification Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes X

53 Scaling issues Nearest Neighbor Classification Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M

54 Nearest neighbor Classification k-nn classifiers are lazy learners It does not build models explicitly Unlike eager learners such as decision tree induction and rule-based systems Classifying unknown records are relatively expensive

55 55

56 Bayes s rule Probability of event H given evidence E : Pr[ H E] Pr[ E H ]Pr[ H ] Pr[ E] A priori probability of H : Probability of event before evidence is seen A posteriori probability of H : Pr[H ] Pr[ H E] Probability of event after evidence is seen Thomas Bayes Born: 1702 in London, England Died: 1761 in Tunbridge Wells, Kent, England

57 Naïve Bayes for classification Classification learning: what s the probability of the class given an instance? Evidence E = instance Event H = class value for instance Naïve assumption: evidence splits into parts (i.e. attributes) that are independent Pr[H E] Pr[E H]Pr[E H]K Pr[E H]Pr[H] 1 2 n Pr[E]

58 Outlook Temp Humidity Windy Play Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Probabilities for weather data Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No Outlook Temperature Humidity Windy Play Yes No Yes No Yes No Yes No Yes No Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5 Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3 Rainy 3 2 Cool 3 1 Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/1 4 Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 Rainy 3/9 2/5 Cool 3/9 1/5 5/1 4

59 Probabilities for weather data Outlook Temperature Humidity Windy Play Yes No Yes No Yes No Yes No Yes No Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5 Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3 Rainy 3 2 Cool 3 1 Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14 Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 Rainy 3/9 2/5 Cool 3/9 1/5 A new day: Likelihood of the two classes For yes = 2/9 3/9 3/9 3/9 9/14 = 0.0053 Outlook Temp. Humidity Windy Play For no = 3/5 1/5 4/5 3/5 5/14 = 0.0206 Sunny Cool High True? Conversion into a probability by normalization: P( yes ) = 0.0053 / (0.0053 + 0.0206) = 0.205 P( no ) = 0.0206 / (0.0053 + 0.0206) = 0.795

60 Numeric attributes Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) The probability density function for the normal distribution is defined by two parameters: Sample mean n 1 n i 1 x i Standard deviation n 1 2 ( x i ) n 1 i 1 Then the density function f(x) is f ( x ) 1 2 2 ( x) e 2 2 60

61 Statistics for weather data Outlook Temperature Humidity Windy Play Yes No Yes No Yes No Yes No Yes No Sunny 2 3 64, 68, 65, 71, 65, 70, 70, 85, False 6 2 9 5 Overcast 4 0 69, 70, 72, 80, 70, 75, 90, 91, True 3 3 Rainy 3 2 72, 85, 80, 95, Sunny 2/9 3/5 =73 =75 =79 =86 False 6/9 2/5 9/14 5/14 Overcast 4/9 0/5 =6.2 =7.9 =10.2 =9.7 True 3/9 3/5 Rainy 3/9 2/5 Example density value: f ( temperature 66 yes) 1 2 6.2 e (66 73) 2 6.2 2 2 0.0340

62 Classifying a new day A new day: Outlook Temp. Humidity Windy Play Sunny 66 90 true? Likelihood of yes = 2/9 0.0340 0.0221 3/9 9/14 = 0.000036 Likelihood of no = 3/5 0.0291 0.0380 3/5 5/14 = 0.000136 P( yes ) = 0.000036 / (0.000036 + 0. 000136) = 20.9% P( no ) = 0.000136 / (0.000036 + 0. 000136) = 79.1% Missing values during training are not included in calculation of mean and standard deviation

63 Naïve Bayes: discussion Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) Why? Because classification doesn t require accurate probability estimates as long as maximum probability is assigned to correct class However: adding too many redundant attributes will cause problems (e.g. identical attributes) Note also: many numeric attributes are not normally distributed

64 Lab5 Decision Tree, Naïve Bayes, and k-nearest Neighbors

65 Use this operator if you want to uses the output more than once

67 Lab6 Decision Tree, Naïve Bayes, and k- Nearest Neighbors with Sub Processes

68 Use this operators to group some operator in one block Use this operators to select subprocesses (like switch-case)

72 Module 9 Classification Evaluation Techniques

73 Issues in evaluation Statistical reliability of estimated differences in performance ( significance tests) Choice of performance measure: Number of correct classifications Accuracy of probability estimates Error in numeric predictions Costs assigned to different types of errors Many practical applications involve costs

74 Training and testing I Natural performance measure for classification problems: error rate Success: instance s class is predicted correctly Error: instance s class is predicted incorrectly Error rate: proportion of errors made over the whole set of instances Resubstitution error: error rate obtained from training data Resubstitution error is (hopelessly) optimistic!

75 Training and testing II Test set: independent instances that have played no part in formation of classifier Assumption: both training data and test data are representative samples of the underlying problem Test and training data may differ in nature Example: classifiers built using customer data from two different towns A and B To estimate performance of classifier from town A in completely new town, test it on data from B

76 Note on parameter tuning It is important that the test data is not used in any way to create the classifier Some learning schemes operate in two stages: Stage 1: build the basic structure Stage 2: optimize parameter settings The test data can t be used for parameter tuning! Proper procedure uses three sets: training data, validation data, and test data Validation data is used to optimize parameters

77 Making the most of the data Once evaluation is complete, all the data can be used to build the final classifier Generally, the larger the training data the better the classifier (but returns diminish) The larger the test data the more accurate the error estimate Holdout procedure: method of splitting original data into training and test set Dilemma: ideally both training set and test set should be large!

78 Holdout estimation What to do if the amount of data is limited? The holdout method reserves a certain amount for testing and uses the remainder for training Usually: one third for testing, the rest for training Problem: the samples might not be representative Example: class might be missing in the test data Advanced version uses stratification Ensures that each class is represented with approximately equal proportions in both subsets

79 Repeated holdout method Holdout estimate can be made more reliable by repeating the process with different subsamples In each iteration, a certain proportion is randomly selected for training (possibly with stratificiation) The error rates on the different iterations are averaged to yield an overall error rate This is called the repeated holdout method Still not optimum: the different test sets overlap Can we prevent overlapping?

80 Cross-validation Cross-validation avoids overlapping test sets First step: split data into k subsets of equal size Second step: use each subset in turn for testing, the remainder for training Called k-fold cross-validation Often the subsets are stratified before the cross-validation is performed The error estimates are averaged to yield an overall error estimate

81 More on cross-validation Standard method for evaluation: stratified ten-fold cross-validation Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate There is also some theoretical evidence for this Stratification reduces the estimate s variance Even better: repeated stratified cross-validation E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)

83 Stratified 10-fold CV Round 1 Round 2 Round 3 Round 4 Round 10

84 Leave-One-Out cross-validation Leave-One-Out: a particular form of cross-validation: Set number of folds to number of training instances I.e., for n training instances, build classifier n times Makes best use of the data Involves no random subsampling Very computationally expensive (exception: NN)

86 Metrics for Performance Evaluation Confusion Matrix: PREDICTED CLASS Class=Yes Class=No ACTUAL CLASS Class=Yes TP FN Class=No FP TN TP = True Positive TN = True Negative FP = False Positive (Type I Error) FN = False Negative (Type II Error)

87 Metrics for Performance Evaluation Confusion Matrix: PREDICTED CLASS Class=Yes Class=No ACTUAL CLASS Class=Yes TP FN Class=No FP TN (all correctly classified) Accuracy = all instances = TP+TN all

88 Limitation of Accuracy Consider a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example

89 Cost Matrix PREDICTED CLASS C(i j) Class=Yes Class=No ACTUAL CLASS Class=Yes C(Yes Yes) C(No Yes) Class=No C(Yes No) C(No No) C(i j): Cost of misclassifying class j example as class i

90 Computing Cost of Classification Cost Matrix ACTUAL CLASS PREDICTED CLASS C(i j) + - + -1 100-1 0 Model M 1 PREDICTED CLASS Model M 2 PREDICTED CLASS ACTUAL CLASS + - + 150 40-60 250 ACTUAL CLASS + - + 250 45-5 200 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255

91 Metrics for Performance Evaluation Confusion Matrix: PREDICTED CLASS Class=Yes Class=No ACTUAL CLASS Class=Yes TP FN Class=No FP TN True Positive Rate (TPR)= Sensitivity = TP TP+FN

92 Metrics for Performance Evaluation Confusion Matrix: PREDICTED CLASS Class=Yes Class=No ACTUAL CLASS Class=Yes TP FN Class=No FP TN False positive rate (FPR)= Fall-out = FP TN+FP

93 Metrics for Performance Evaluation Confusion Matrix: PREDICTED CLASS Class=Yes Class=No ACTUAL CLASS Class=Yes TP FN Class=No FP TN False negative rate (FNR)= Miss Rate = FN TP+FN

94 Metrics for Performance Evaluation Confusion Matrix: PREDICTED CLASS Class=Yes Class=No ACTUAL CLASS Class=Yes TP FN Class=No FP TN True Negative Rate (TNR) = Specificity (SPC) = TN FN+TN

95 Measures in information retrieval Percentage of retrieved documents that are relevant: precision=tp/(tp+fn) Percentage of relevant documents that are returned: recall =TP/(TP+FP) Precision/recall curves have hyperbolic shape Summary measures: average precision at 20%, 50% and 80% recall (three-point average recall) F-measure=(2 recall precision)/(recall+precision)

96 Metrics for Performance Evaluation Confusion Matrix: PREDICTED CLASS Class=Yes Class=No ACTUAL CLASS Class=Yes TP FN Class=No FP TN Precision= TP TP+FN

97 Metrics for Performance Evaluation Confusion Matrix: PREDICTED CLASS Class=Yes Class=No ACTUAL CLASS Class=Yes TP FN Class=No FP TN Recall = TP TP+FP

98 Metrics for Performance Evaluation Confusion Matrix: PREDICTED CLASS Class=Yes Class=No ACTUAL CLASS Class=Yes TP FN Class=No FP TN F-Measure= 2 Precision Recall Precision+ Recall = 2 TP 2 TP+FP+FN

99 Methods for Performance Evaluation How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets

100 ROC curves Stands for receiver operating characteristic Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel y axis shows percentage of true positives in sample rather than absolute number x axis shows percentage of false positives in samplerather than sample size

101 ROC Curve (TP,FP): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class

102 How to Construct an ROC curve Instance P(+ A) True Class 1 0.95 + 2 0.93 + 3 0.87-4 0.85-5 0.85-6 0.85 + 7 0.76-8 0.53 + 9 0.43-10 0.25 + Use classifier that produces posterior probability for each test instance P(+ A) Sort the instances according to P(+ A) in decreasing order Apply threshold at each unique value of P(+ A) Count the number of TP, FP, TN, FN at each threshold TP rate, TPR = TP/(TP+FN) FP rate, FPR = FP/(FP + TN)

103 How to construct an ROC curve Class + - + - - - + - + + 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 TP 5 4 4 3 3 3 3 2 2 1 0 FP 5 5 4 4 3 2 1 1 0 0 0 TN 0 0 1 1 2 3 4 4 5 5 5 FN 0 1 1 2 2 2 2 3 3 4 5 TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0 FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0 ROC Curve:

104 ROC for one Classifier Good separation between the classes, convex curve.

105 ROC for one Classifier Reasonable separation between the classes, mostly convex.

106 ROC for one Classifier Fairly poor separation between the classes, mostly convex.

107 ROC for one Classifier Poor separation between the classes, large and small concavities.

108 ROC for one Classifier Random performance.

True Positive Rate True Positive Rate True Positive Rate True Positive Rate 109 AUC for ROC curves 100% 100% AUC = 100% AUC = 50% 0 % 0 % False Positive Rate 100 % 0 % 0 % False Positive Rate 100 % 100% 100% AUC = 90% AUC = 65% 0 % 0 % False Positive Rate 100 % 0 % 0 % False Positive Rate 100 %

110 Interpretation of AUC AUC can be interpreted as the probability that the test result from a randomly chosen diseased individual is more indicative of disease than that from a randomly chosen nondiseased individual: P(X i X j D i = 1, D j = 0) So can think of this as a nonparametric distance between disease/nondisease test results

111 Lab7 Decision Tree with 10-Fold Cross Validation

112 Validation

113

114 Lab8 Decision Tree with Parameters Looping

115 You can iterate everything

116 You can keep log in tabular format

117 Set Heart_attack as a label

118

119

120

121 Run

122 Module 10 Regression

123 Regression For classification the output(s) is nominal In regression the output is continuous Function Approximation Many models could be used Simplest is linear regression Fit data with the best hyper-plane which "goes through" the points For each point the differences between the predicted point and the actual observation is the residue y x

124 Simple Linear Regression For now, assume just one (input) independent variable x, and one (output) dependent variable y Multiple linear regression assumes an input vector x Multivariate linear regression assumes an output vector y We will "fit" the points with a line (i.e. hyper-plane) Which line should we use? Choose an objective function For simple linear regression we choose sum squared error (SSE) S (predicted i actual i ) 2 = S (residue i ) 2 Thus, find the line which minimizes the sum of the squared residues (e.g. least squares)

125 How do we "learn" parameters For the 2-d problem (line) there are coefficients for the bias and the independent variable (y-intercept and slope) Y = b 0 + b 1 X To find the values for the coefficients which minimize the objective function we take the partial derivates of the objective function (SSE) with respect to the coefficients. Set these to 0, and solve. b 1 = n å xy -åxå y nåx 2 - ( åx) 2 b 0 = åy - b 1 n å x

126 Multiple Linear Regression Y X X... 0 1 1 n X n There is a closed form for finding multiple linear regression weights which requires matrix inversion, etc. There are also iterative techniques to find weights One is the delta rule. For regression we use an output node which is not thresholded (just does a linear sum) and iteratively apply the delta rule thus the net is the output 2 2 Dw i = c(t -net)x i Where c is the learning rate and x is input for that weight Delta rule will update towards the objective of minimizing the SSE, thus solving multiple linear regression There are many other regression approaches that give different results by trying to better handle outliers and other statistical anomalies

127 SSE and Linear Regression SSE chooses to square the difference of the predicted vs actual. Why square? Don't want residues to cancel each other Could use absolute or other distances to solve problem S predicted i actual i : L1 vs L2 SSE leads to a parabolic error surface which is good for gradient descent Which line would least squares choose? There is always one best fit

128 SSE and Linear Regression SSE chooses to square the difference of the predicted vs actual. Why square? Don't want residues to cancel each other Could use absolute or other distances to solve problem S predicted i actual i : SSE leads to a parabolic error surface which is good for gradient descent Which line would least squares choose? There is always one best fit 7

129 SSE and Linear Regression SSE chooses to square the difference of the predicted vs actual. Why square? Don't want residues to cancel each other Could use absolute or other distances to solve problem S predicted i actual i : SSE leads to a parabolic error surface which is good for gradient descent Which line would least squares choose? There is always one best fit 5 7

130 SSE and Linear Regression SSE chooses to square the difference of the predicted vs actual. Why square? Don't want residues to cancel each other Could use absolute or other distances to solve problem S predicted i actual i : SSE leads to a parabolic error surface which is good for gradient descent Which line would least squares choose? There is always one best fit Note that the squared error causes the model to be more highly influenced by outliers Though best fit assuming Gaussian noise error from true surface 5 3 7 3

131 SSE and Linear Regression Generalization In generalization all x values map to a y value on the chosen regression line y Input Value 1 5 3 7 3 0 1 2 3 x Input Value

132 Non-Linear Tasks Linear Regression will not generalize well to the task below Needs a non-linear surface Could do a feature pre-process as with the quadric machine For example, we could use an arbitrary polynomial in x Thus it is still linear in the coefficients, and can be solved with delta rule, etc. What order polynomial should we use? Overfit issues occur as we'll discuss later Y = b 0 + b 1 X + b 2 X 2 + + b n X n

133 Evaluating numeric prediction Same strategies: independent test set, cross-validation, significance tests, etc. Difference: error measures Actual target values: a 1 a 2 a n Predicted target values: p 1 p 2 p n Most popular measure: mean-squared error Easy to manipulate mathematically 2 2 1 n) ( p a1)... ( pn a n

134 Other measures The root mean-squared error : 2 2 1 n) ( p a1)... ( pn a n The mean absolute error is less sensitive to outliers than the meansquared error: p a1... n Sometimes relative error values are more appropriate (e.g. 10% for an error of 50 when predicting 500) p 1 n an

135 Improvement on the mean How much does the scheme improve on simply predicting the average? The relative squared error is ( ): The relative absolute error is: 2 2 1 2 2 1 1 ) (... ) ( ) (... ) ( n n n a a a a a p a p the average a is...... 1 1 1 n n n a a a a a p a p

136 Correlation coefficient Measures the statistical correlation between the predicted values and the actual values Scale independent, between 1 and +1 Good performance leads to large values! A P PA S S S 1 ) )( ( n a a p p S i i i PA 1 ) ( 2 n p p S i i P 1 ) ( 2 n a a S i i A

137 Correlation coefficient

138 Which measure? Best to look at all of them Often it doesn t matter Example: A B C D Root mean-squared error 67.8 91.7 63.3 57.4 Mean absolute error 41.3 38.5 33.4 29.2 Root rel squared error 42.2% 57.2% 39.4% 35.8% Relative absolute error 43.1% 40.1% 34.8% 30.4% Correlation coefficient 0.88 0.88 0.89 0.91 D best C second-best A, B arguable

139 Lab9 Linear Regression

140 Retrieve Data08_wine

141

142

143 Module 11 Advanced Classification Techniques

144 Neural Networks What is a Neural Network? Biologically motivated approach to machine learning Similarity with biological network Fundamental processing elements of a neural network is a neuron 1.Receives inputs from other source 2.Combines them in someway 3.Performs a generally nonlinear operation on the result 4.Outputs the final result

145 Neural Network Neural Network is a set of connected INPUT/OUTPUT UNITS, where each connection has a WEIGHT associated with it. Neural Network learning is also called CONNECTIONIST learning due to the connections between units. It is a case of SUPERVISED, INDUCTIVE or CLASSIFICATION learning.

146 Neural Network Neural Network learns by adjusting the weights so as to be able to correctly classify the training data and hence, after testing phase, to classify unknown data. Neural Network needs long time for training. Neural Network has a high tolerance to noisy and incomplete data

147 One Neuron as a Network Here x1 and x2 are normalized attribute value of data. y is the output of the neuron, i.e the class label. x1 and x2 values multiplied by weight values w1 and w2 are input to the neuron x. Value of x1 is multiplied by a weight w1 and values of x2 is multiplied by a weight w2. Given that w1 = 0.5 and w2 = 0.5 Say value of x1 is 0.3 and value of x2 is 0.8, So, weighted sum is : sum= w1 x x1 + w2 x x2 = 0.5 x 0.3 + 0.5 x 0.8 = 0.55

148 Bias of a Neuron We need the bias value to be added to the weighted sum wixi so that we can transform it from the origin. v = wixi + b, here b is the bias x2 x1-x2= -1 x1-x2=0 x1-x2= 1 x1

149 Bias as extra input Input Attribute values x 0 = +1 x 1 x 2 x m w 0 W1 w 2 w m weights v Activation function ( ) Summing function v m wjx j Output class y w 0 j 0 b

150 Neuron with Activation The neuron is the basic information processing unit of a NN. It consists of: 1 A set of links, describing the neuron inputs, with weights W 1, W 2,, W m 2. An adder function (linear combiner) for computing the weighted sum of the inputs (real numbers): u m wjxj j 1 3 Activation function : for limiting the amplitude of the neuron output. y (u b)

151 Activation Function

152 Multi-layer 152

153 Why We Need Multi Layer? Linear Separable: x y x y Linear inseparable: Solution? x y

154 Why We Need Multi Layer? False True f : >=1? Output Layer S w1 = 1 w2 = 1 OR function x1 x2 Input Layer

155 Why We Need Multi Layer? f : >=1? S False True Output Layer w1 = 0.5 w2 = 0.5 AND function x1 x2 Input Layer

156 Why We Need Multi Layer? f : >=1? False True Output Layer S w = 1 w = 1 f : >=1? f : >=1? XOR function S S Hidden Layer w = -1 w = -1 x1 w = 1 w = 1 x2 Input Layer

157 A Multilayer Feed-Forward Neural Output Class O k Output nodes Network O j Hidden nodes w jk w ij - weights Input nodes Input Record : x i Network is fully connected

158 Classification by Back propagation Back Propagation learns by iteratively processing a set of training data (samples). For each sample, weights are modified to minimize the error between network s classification and actual classification.

159 Steps in Back propagation Algorithm STEP ONE: initialize the weights and biases. The weights in the network are initialized to random numbers from the interval [-1,1]. Each unit has a BIAS associated with it The biases are similarly initialized to random numbers from the interval [-1,1]. STEP TWO: feed the training sample.

160 Steps in Back propagation Algorithm ( cont..) STEP THREE: Propagate the inputs forward; we compute the net input and output of each unit in the hidden and output layers. STEP FOUR: back propagate the error. STEP FIVE: update weights and biases to reflect the propagated errors. STEP SIX: terminating conditions.

161 Terminating Conditions All w ij in the previous epoch are below some threshold, or The percentage of samples misclassified in the previous epoch is below some threshold, or a pre specified number of epochs has expired. In practice, several hundreds of thousands of epochs may be required before the weights will converge.

162 Artificial Neural Network

163 Artificial Neural Network

164 Training Process of the MLP The training will be continued until the RMS is minimized. ERROR Local Minimum W (N dimensional) Global Minimum Local Minimum

165 Back prop requires many epochs to converge Some ideas to overcome this Stochastic learning Faster Convergence Update weights after each training example Momentum Add fraction of previous update to current update Faster convergence

166 Advantages of ANN Able to deal with (identify/model) highly nonlinear relationships Not prone to restricting normality and/or independence assumptions Can handle variety of problem types Usually provides better results (prediction and/or clustering) compared to its statistical counterparts Handles both numerical and categorical variables (transformation needed!)

167 Disadvantages of ANN They are deemed to be black-box solutions, lacking expandability It is hard to find optimal values for large number of network parameters Optimal design is still an art: requires expertise and extensive experimentation It is hard to handle large number of variables (especially the rich nominal attributes) Training may take a long time for large datasets; which may require case sampling

168 Remember.. ANNs perform well, generally better with larger number of hidden units More hidden units generally produce lower error Determining network topology is difficult Choosing single learning rate impossible Difficult to reduce training time by altering the network topology or learning parameters NN(Subset) often produce better results

169 Lab10 Neural Network with 10-fold Cross Validation

170 Retrieve Data08_Car

171 Input hidden layer Layer Name: H1 Size: 6

172 Random Tree The Random Tree learns decision trees from both nominal and numerical data. Decision trees are powerful classification methods which can be easily understood. The Random Tree operator works similar to Quinlan's C4.5 or CART but it selects a random subset of attributes before it is applied. The size of the subset is specified by the subset ratio parameter.

173

174 Random Forest The Random Forest operator generates a set of random trees. The random trees are generated in exactly the same way as the Random Tree operator generates a tree.

175 Gradient Boosted Trees

176

177 Lab11 All about trees

178

179 Try evaluating them

180

181 Optimization To find the optimal values of the selected parameters for the Operators in its subprocess.

182 Optimization Optimize Parameters (Grid) Executes the subprocess for all combinations of selected values of the parameters and then delivers the optimal parameter values. Optimize Parameters (Quadratic) Find the optimal values for a set of parameters using a quadratic interaction model. Optimize Parameters (Evolutionary) Find the optimal values for a set of parameters using an evolutionary approach often more appropriate than a grid search (as in the Optimize Parameters (Grid) operator) or a greedy search (as in the Optimize Parameters (Quadratic) operator) and leads to better results.

183 Lab12 Parameter Optimization

184

185

186 Lab13 Prescriptive Analytics

187 class name = computer

188 Module 12 Model Integration and Results Optimization

189 Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made by multiple classifiers

190 General Idea D Original Training data Step 1: Create Multiple Data Sets... D 1 D 2 D t-1 D t Step 2: Build Multiple Classifiers C 1 C 2 C t -1 C t Step 3: Combine Classifiers C *

191 Why does it work? Suppose there are 25 base classifiers Each classifier has error rate, = 0.35 Assume classifiers are independent Probability that the ensemble classifier makes a wrong prediction: 25 i 13 25 i (1 ) i 25 i 0.06

192 Examples of Ensemble Methods How to generate an ensemble of classifiers? Bagging Boosting

193 Bagging Sampling with replacement Original Data 1 2 3 4 5 6 7 8 9 10 Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9 Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2 Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7 Build classifier on each bootstrap sample Each sample has probability (1 1/n) n of being selected

194 Boosting An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records Initially, all N records are assigned equal weights Unlike bagging, weights may change at the end of boosting round

195 Boosting Records that are wrongly classified will have their weights increased Records that are classified correctly will have their weights decreased Original Data 1 2 3 4 5 6 7 8 9 10 Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3 Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2 Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4 Example 4 is hard to classify Its weight is increased, therefore it is more likely to be chosen again in subsequent rounds

196 RapidMiner: Bagging and Boosting Put multiple classifiers inside one this operator Put one classifier inside one of these operators

197 Lab14 Voting

198

199

200 Some of other useful operators Convert Trees to Rules Real-time output simulator Group any kinds of models to a single model sequentially Create Mathematical formulas from such model